Beyond Translation Memories: finding similar documents in comparable corpora
نویسنده
چکیده
This paper presents our most recent research in the context of TTC, an EU funded research project, on using the Web to retrieve terminologically rich texts in a specific domain, and to find similar documents in such comparable corpora. The aim of this work is to provide tools for semi-automatic construction of bilingual term lists. 1 Parallel and comparable corpora Re-use of existing translations lies at the centre of translation practice as we know it now. Translation Memories fuel the daily activities of translators, while Machine Translation engines build their translation models from collections of parallel texts (this applies to both Statistical MT, like Google, and to traditional rule-based MT systems enriched with statistics, like Systran). However, many more texts are produced in each language on a daily basis than translated by professional translators. Even when translated texts exist, they are often not available in a given application or for a given translation task. There is a particular shortage of parallel texts in the areas undergoing recent developments. This leads to great interest in utilising comparable (=less parallel) resources for translation tasks. Informally, any collection of texts in two languages can be positioned on a cline from ‘fully parallel’ to ‘unrelated’ with several options in between: noisy parallel texts: such texts introduce minor language-specific adaptations, e.g., an example for searching New York in the OpenOffice manual might be replaced with北京 (‘Beijing’) in its Chinese translation; strongly comparable texts: such texts are not designed to be translations, while they are still devoted to the same narrow topic and their producers are aware of a related text in a different language, e.g., interlinked wikipedia articles or news items concerning exactly the same specific event in different languages; weakly comparable texts: such texts are devoted to the same topic, but they are produced completely independently, e.g., English and Chinese textbooks on designing the wind turbines, or parliamentary debates on health care from the Bundestag, the House of Commons and the Russian Duma. This paper reports our experience from the TTC project (Translation, Terminology and Comparable Corpora),1 which is aimed at developing tools for (1) collecting comparable corpora from the Web, (2) http://ttc-project.eu/
منابع مشابه
Ninth Workshop on Building and Using Comparable Corpora Workshop Programme
Comparable corpora are the most versatile and valuable resource for multilingual Natural Language Processing. The speaker will argue that comparable corpora can support a wider range of applications than has been demonstrated so far in the state of the art. The talk will present completed and ongoing work conducted by the speaker and colleagues from his research group where comparable corpora a...
متن کاملA light way to collect comparable corpora from the Web
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages, parallel corpora are not readily available. To overcome this problem previous work has recognized the potential of using comparable corpora as training data. The process of obtaining such data usually involves (1) downloading a separate list of documents ...
متن کاملBootstrapping Translation Detection and Sentence Extraction from Comparable Corpora
Most work on extracting parallel text from comparable corpora depends on linguistic resources such as seed parallel documents or translation dictionaries. This paper presents a simple baseline approach for bootstrapping a parallel collection. It starts by observing documents published on similar dates and the cooccurrence of a small number of identical tokens across languages. It then uses fast...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bior multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extract...
متن کامل